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Abstract: We propose a Bayesian nonparametric model to infer population admixture, extending 
the Hierarchical Dirichlet Process (HDP, Teh et al. 2006) to allow for correlation between loci due to 
Linkage Disequilibrium. Given multilocus genotype data from a sample of individuals, the model allows 
inferring classifying individuals as unadmixed or admixed, inferring the number of subpopulations 
ancestral to an admixed population and the population of origin of chromosomal regions. Our model 
does not assume any specific mutation process and can be applied to most of the commonly used genetic 
markers. We present a MCMC algorithm to perform posterior inference from the model and discuss 
methods to summarise the MCMC output for the analysis of population admixture. We demonstrate 
the performance of the proposed model in simulations and in a real application, using genetic data 
from the EDAR gene, which is considered to be ancestry-informative due to well-known variations in 
allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset 
leads to the identification of a rare haplotype in Europeans. 
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1. Introduction 

Population stratification, or population structure, refers to the presence of a systematic difference in genetic 
markers’ allele frequencies between subpopulations due to variation in ancestry. This phenomenon arises 
from the bio-geographical distribution of human populations. The analysis of population structure presents 
an important problem in population genetics and arises in many contexts. It is central to the understanding of 
human migratory history and the genesis of modern populations [49, 47]. The associated admixture analysis 
of individuals is important in correcting the confounding effects of population ancestry on gene mapping [60] 
and association studies [42]. It is also useful in the analysis of gene flow in hybridization zones [16] and invasive 
species [46] , conservation genetics [59] and domestication events [35]. The establishment of inexpensive single 
nucleotide polymorphism (SNP) genotyping platforms in recent years has facilitated collection of markers 
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to assess genetic ancestry in human populations and in general to investigate genetic relationships in living 
organisms. This paper focuses on a particular form of population structure: admixture. Genetic admixtures 
occur when two or more previously isolated populations begin interbreeding, resulting in the introduction of 
new genetic lineages into a population (e.g. the African-American population). 

Broadly speaking, the aim of population structure analysis based on genetic data are: (i) detection of 
population structure; (ii) estimating the number of subpopulations in a sample; (iii) assigning individual 
in subpopulations; (iv) defining the number of ancestral populations in admixed populations; (v) inferring 
ancestral population proportions to admixed individuals; (vi) identifying the genetic ancestry of distinct 
chromosomal segments within an individual. A variety of modelling approaches have been proposed in the 
literature. Two of the most widely used approaches are Principal Component Analysis (PCA) and model- 
based estimation of ancestry, mainly involving clustering techniques or hidden Markov models. The PCA 
approach has been used to infer population structure for several decades. In the PCA approach, the indi¬ 
viduals’ genotypes are projected onto a lower dimensional space so that the locations of individuals in the 
projected space reflects their genetic similarities [38, 34]. It should be noted that the top principal com¬ 
ponents do not always capture population structure but may reflect family relatedness, long range linkage 
disequilibrium or simply genotyping artefacts. Model based methods aim to reconstruct historical events and 
therefore to infer explicitly genetic ancestry (e.g. [43, 55, 2]). In the structured association approach samples 
are assigned to sub-population clusters, possibly allowing for fractional membership. 

An influential early approach is STRUCTURE by Pritchard et al. (2000) [43], which assumes that indi¬ 
viduals come from one of K sub-populations. Based on Bayesian mixture models, population membership 
and population specific allele frequencies are jointly estimated from the data. This simple framework can 
be extended to genetic admixtures, allowing individuals to have ancestry from more than one population. 
For each individual, STRUCTURE determines what proportion of the individuals’ genome comes from each 
population, while the alleles at different loci are modelled conditionally independently given these admixture 
proportions. Taking a Bayesian approach to inference, independent priors on the allelic profile parameters 
of each population are specified and posterior inference is performed through Markov chain Monte Carlo 
(MCMC). Various extensions to STRUCTURE have been proposed to address a number of shortcomings. 

One problem with STRUCTURE, which we address in this paper , is that of admixture linkage disequilib¬ 
rium among neighbouring loci. When individuals from different groups admix, their offsprings’ DNA become 
a mixture of the DNA from each admixing group. Chunks of DNA are passed along through subsequent 
generations, up to the present day. Therefore, the genomes of the descendants contain segments of DNA 
inherited from each of the original populations. The shorter the distance between two loci, the higher the 
probability that the population of ancestry will be the same at these two loci. This means that ancestry states 
are autocorrelated. The lengths of uninterrupted DNA segments inherited from each sub-population reflect 
how long ago the admixture event occurred. In general long uninterrupted segments from each population 
imply a recent admixture event. The original version of STRUCTURE did not deal with admixture linkage 
disequilibrium and as a result it is necessary to thin out tightly-linked loci to reduce correlations which can 
affect the quality of inference. In [13], Falush et al. (2003) improved on this issue by introducing a module to 
model linkage locally among neighbouring loci, using a Markov model which segments each chromosome into 
contiguous regions with shared genetic ancestry. This allows local genetic ancestry from genotype data to be 
inferred, as opposed to the global admixture proportions in [43]. Such local ancestry estimation gives more 
fine-grained information about the admixture process. In [43] and [13], each population is characterised by an 
allele frequency profile and they assume that the alleles at each locus are independent. Further dependence 
can be modelled by introducing dependence among allelic profiles due to common ancestry [13] or by using a 
Markov model for the alleles conditionally on the ancestral segmentation [55]. Another class of models uses 
haplotypic profiles as richer representations for the allelic dependence within populations [21]. 

Another important statistical concern in admixture modelling, which we address in this paper, is the 
determination of the number of ancestral populations. In Pritchard et al. (2000) [43], this is achieved using 
a model selection criteria based on MCMC estimates of the log marginal probabilities of the data and the 
Bayesian deviance information criterion, though it has been noted by [13] that such estimates are highly 
sensitive to prior specifications regarding the relatedness of populations. See also Corander et al. (2003) [8] 
and Evanno et al. (2005) [12] for other parametric approaches to determining the number of populations. 
One way in which such model selection can be sidestepped is by using Bayesian nonparametric models [22], 
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which offer a flexible framework and do not require the specification of a fixed and finite model size (which, 
in our context, is the number of populations). Rather, one assumes an unbounded potential model size, of 
which only a finite part is observed on a given finite dataset. Such ideas have been applied to population 
structure inference by Huelsenbeck and Andolfatto [25] who used the Dirichlet Process (DP) to define a 
Bayesian nonparametric counterpart to the “no-admixture” model of Pritchard et al. (2000) [43] (see also 
Dawson and Belkhir [9] and Pella and Masuda [39] for extensions to polyploid data), and by Sohn et al. 
(2012) [53] who used infinite hidden Markov models [6] for modelling linkage disequilibrium in admixture 
modelling. 

In this paper, we propose a method for modelling population structure that simultaneously gives estimates 
of local ancestries, bypasses difficult model selection issues using Bayesian nonparametrics, and is designed to 
be computationally more scalable than current Bayesian nonparametric methods. Our approach is based on 
using the hierarchical Dirichlet process (HDP) [56] to non-parametrically model the unknown and uncertain 
number of populations without having to perform costly model selection. Unlike [53], we use a simplified 
transition model in which, during a transition event, the founder identities on either side of the transition 
are independent. This transition model requires a linear rather than quadratic (in the number of founders) 
number of parameters, as well as a forward-backward algorithm which scales linearly in the number of extant 
populations while introducing less auxiliary variables which can slow down convergence. 

In Section 2 we introduce our Bayesian nonparametric model, as well as a novel Markov chain Monte 
Carlo method which allows efficient exact inference using a retrospective slice sampling truncation scheme. 
Section 3 describes results of simulation studies, and Section 4 describes population structure analyses of 
genotype data from the EDAR gene region. Section 5 closes with a discussion of our findings as well as 
potential future work. 

2. Method 

We assume we have multilocus genotype data from a sample of admixed individuals arising from a number 
of populations. For simplicity, we suppose that there are N haploid individuals genotyped at L loci, and we 
denote by X = (xu)i<i<N,i<l<L the observed data, where Xu is the allele of individual i at locus l. 

2.1. Model specification 

Let K be the number of ancestral populations. We denote by Qi = ( qik)i<k<K the vector of admixture 
proportions of individual i, where qik denotes the proportion of the genome of individual i which can be 
traced to population k. While previous works used a finite value for K, we will take a Bayesian nonparametric 
approach and let K —> oo, so that there is an unbounded number of potential populations in the model. To 
account for dependence among loci, we use the model of linkage proposed by Falush et al. (2003) [13]. This 
employs a hidden Markov model which splits the genome into contiguous chunks with common ancestry. The 
model is parameterised by: dj, the genetic distance between locus l and locus l + 1, for each l = 1,..., L — 1, 
and r, the rate at which splits occur. Let Zu be a variable which denotes the population ancestry at locus l 
of individual i, and su be a binary variable which denotes whether locus l and locus l + 1 are in the same 
chunk (su = 1) or not (su = 0). The variables Su can be thought of as linkage indicators. The transition 
model is as follows: 


Zu ~ Discrete(Qi) 

Si t i .|-i ~ Bernoulli(e -rdi ), l = 1,.. .,L — 1 


+ ,/+ 1 + l ; Zil 


= Zu 

~ Discrete(Qi) 


If + ; 

if = 0. 


( 2 . 1 ) 


The probability of a split between loci l and 1+1 is 1— e ~ rdl , and the ancestral populations of each chromosome 
segment are independent and identically distributed, with probability q ^ for the ancestral population to be 
k for each chunk. The admixture model of Prichard et al. (2000) [43] can be recovered asr-> oo, as all loci 
become independent and the chunks consist of a single locus. 
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The model is completed by specifying the likelihood function for the observed alleles. We will assume that 
within each population Hardy-Weinberg equilibrium holds, and we can model the the allelic profile of the 
fcth population simply by specifying the vector of allele frequencies, that is 9k = (9kia)i<i<L,i<a<A, where 
Okia is the probability for allele a at locus l in population k. That is, 

xu\zu = k ~ Discrete (0 k i) (2.2) 

where 9ki = ( 9kia)i<a<A ■ For example, in the case of single nucleotide polymorphism (SNP) data, Xu are 
binary valued and modelled using Bernoulli distributions with means given by Oku- 

2.2. Prior specification and Bayesian nonparametrics 

We use a Dirichlet prior for the admixture proportions Qt. The typical prior in previous works [5, 45, 43, 
13] is given by a symmetric Dirichlet distribution, which assumes that all populations have a priori equal 
contribution to the observed genomes. We will use an asymmetric Dirichlet with mean Qq = (qok)i<k<K 
instead, to capture the assumption that some populations may be more prevalent than others, so has a priori 
higher chance of contributing more genetic material to each individual (see also Anderson [3] and Anderson 
and Thompson [4]): 


Qi\Qo ~ Dirichlet (aQ 0 ) 


(2.3) 


where a > 0 is a parameter which controls the concentration of the Dirichlet prior around its mean Q o. 

The asymmetric Dirichlet also allows for a Bayesian nonparametric model, in which the number of popu¬ 
lations K is taken to be infinite, while the corresponding infinite K limit does not lead to a mathematically 
well-defined model for the symmetric Dirichlet. Specifically, consider a hierarchical prior on Q o expressed as 
the so-called stick-breaking representation [52], 

For j = 1,2,...: v 0 j ~ Beta(l, a 0 ), (2.4) 

3 ~ 1 

q 0j = v 0j (1 v 0j f) 
j'=i 

where a o is a hyperparameter which controls the overall diversity of populations, with larger ao corresponding 
to a larger number of populations with more uniform proportions. The conditional distribution of Qi given 
Qo is still a Dirichlet as given in (2.3), though we need to extend the definition to one for infinite-dimensional 
vectors. Specifically, a constructive definition of such an infinite-dimensional Dirichlet is given as follows: 


For j = 1,2,...: 


3 

Vij ~ Beta(au(y, a(l — E «<y')) (2-5) 

T=1 

3- 1 

Qij = v n n 1 %')• 
j '=i 


While our model assumes theoretically the existence of an infinite number of populations, given a particular 
finite-sized dataset, only a finite (but random) number of populations will be used to model the data, and 
the posterior distribution over this number can be used to estimate the number of populations exhibited in 
the data; see Subsection 2.4.1 for details. 

The model is completed by specifying a prior on a, ao, r and Oki, k = 1,..., K; l = 1,..., L. For each 
population fc, we use independent Dirichlet for the allele frequencies at each locus. In the case of SNP data, 
this implies assuming independent Beta prior for each locus in each subpopulation. In our simulations and 
our application to the EDAR data, we take 9ui ~ Beta(c/q, c(l — Pi)), where mi denotes the prior mean for 
the allele frequency, assumed to be the same for all ancestral populations and c is a concentration parameter. 
We choose independent Gamma priors for a and «o for computational reasons. We specify a uniform prior 
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on logr, on a fairly large interval. Recall that di denotes the genetic distance between adjacent markers. If 
it is measured in morgans, then r can be interpreted as an estimate of t, the number of generations since 
the admixture event [13]. When the genetic distance between loci is not available, we can use as a proxy the 
physical distance measured in nucleotides. In this case r be interpreted as an estimate of the product of t 
and the recombination rate (expected number of crossovers per base pair per meiosis). 

Another important issue which arises is the computational requirements for inference in a model with an 
infinite number of populations. In this regard, a range of recent truncation and marginalisation techniques 
can be applied allowing for exact inference using finite computational resources [32, 58, 37, 14]. We propose 
a particular approach in Subsection 2.4, after discussing in the next subsection the theoretical motivation 
for the hierarchical Dirichlet prior described in (2.3) and (2.4). 

2.3. Hierarchical Dirichlet processes 

The stick-breaking prior for the overall population prevalences (2.4) imposes a particular ordering on the 
populations, in which populations with higher index have a priori lower prevalences. This is undesirable from 
a modelling perspective as the induced ordering is artificial, while from a computational perspective it is also 
undesirable as it introduces a label switching problem into the inference, which can slow down convergence of 
inference algorithms [26, 37]. In this section we address this issue by developing a more abstract formalism for 
the model based on a construction of coupled random probability measures called the hierarchical Dirichlet 
process [56] (see also Teh and Jordan [57] for a more recent review). 

Let (0,fl) be a measurable space. The Dirichlet process Go ~ DP(ao>-ff) is a random probability mea¬ 
sure over (0,f2) with the property that for any measurable partition (A\, ... ,Al) of 0 the random prob¬ 
ability vector (Go(Ai),..., Gq(Al)) is distributed according to a Dirichlet distribution with parameters 
(aoH(Ai), ... ,ao H(Al)) [15]. The parameters of the process consist of a positive concentration parameter 
ao and a base probability measure H over (0,D). A variety of more constructive representations exist for 
the Dirichlet process, and the reader is referred to Ghoshal [19] for a review on the DP and to Lijoi and 
Priinster [29] for a review of nonparametric prior distributions generalizing the DP. 

One of the noteworthy properties of the Dirichlet process is that the random probability measure Go is 
discrete almost surely, and can be written in the form 

OO 

Go = ^2 1okSg k . (2.6) 

fc=l 

The atoms (6k)k >l are independent and identically distributed according to the base probability measure 
H 1 while the atom masses are independent of the atoms, and have distribution given by the stick-breaking 
representation (2.4) [52]. 

In the context of admixture modelling, we will suppose that each atom in Go corresponds to a population 
with allelic frequencies parameterised by the atom, while the masses correspond to the population proportions 
or prevalences. In other words, 9k denotes the vector of the population specific allele frequencies for the 
L loci under investigation. As each individual has its own population proportions while the collection of 
populations are shared across individuals, we can model this using the hierarchical Dirichlet process (HDP). 
For each individual i, let Gi be an individual-specific atomic random probability measure. These measures 
are conditionally independent and identically distributed given a common base probability measure Go'- 

G i |Go~DP(a,G 0 ) (2.7) 

Since each atom in Gi is drawn from Go, the collection of atoms in Gi is precisely those in Go, while each 
Gi has its own specific atom masses: 

OO 

G t = J2 9ikSe k ( 2 - 8 ) 

k= 1 

where the masses ((]ik)k>i have distribution as given in (2.5). The HDP allows sharing of the ancestral among 
the indivual distributions as the Gi place atoms at the same discrete locations determined by Go (see Teh 
et al. (2006) [56] for details). 
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We refer to the proposed model as HDPStructure. In summary, Gi describes the proportion of the alleles 
on Xi = (xn ,... ,Xil) coming from each of the populations, as well as the parameters of the populations. 
We model the sequence Xi given Gi as follows: (i) first we place segment boundaries according to an non- 
honrogeneous Poisson process with rate rdi, (ii) then the alleles on each segment are generated by picking a 
population of origin according to Gi , then sampling the alleles according to the population distribution. We 
have expressed the hierarchical prior over the population proportions (2.4), (2.5) as the joint distribution 
of atom masses in a HDP, while the atoms correspond to the population parameters. Further, while the 
stick-breaking representation imposes a particular ordering among the atoms, there is no ordering of atoms 
in the representation as random probability measures themselves. As we will see next, this allows for an 
efficient Markov chain Monte Carlo algorithm for posterior simulation. 

2.4- Markov Chain Monte Carlo 

In this section we describe a Markov chain Monte Carlo (MCMC) algorithm for posterior simulation in the 
HDPStructure model. The MCMC sampling algorithm iterates between updates to the random probability 
measures (Gi)o<i<jv, the latent state sequences (su, Zu)i<i<N,i<l<L, and the model parameters in turn, 
each update conditional upon all the other variables in the model. Updates to the random probability 
measures make use of another representation of the HDP called the Chinese restaurant franchise [56] , as well 
as a retrospective slice sampling technique which allows for a finite truncation to the random probability 
measures while retaining exactness of the procedure. Updates to the latent state sequences make use of 
an efficient forward filtering-backward sampling procedure as a Metropolis-Hastings proposal distribution. 
Finally, updates to model parameters are straightforward one-dinrensional Metropolis-Hastings updates. 
Detailed descriptions of these updates are included in the Appendix. MATLAB software implementing this 
MCMC scheme is freely available at http://BigBayes.github.io/HDPStructure. 

2-4-1. Updates to random probability measures 

Conditioned on the model parameters and latent state sequences, the update to the random probability 
measures (Gi)o<i<jv follow standard results for the hierarchical Dirichlet process [56]. As noted previously, 
since the data is finite, the number of populations used to model the data is finite as well. Conditioned 
on the latent state sequences {zu}, suppose the number of such populations (as a function of the latent 
state sequences) is K*. For simplicity, we may index these populations as 1,..., K*. The random probability 
measures can be expressed as: 

k * k * 

Go = qokde k + wqG'q Gi = qikbe k + WiG\ (2-9) 

k —1 fc=1 

for each i = 1,..., N, where Wi is the total mass of all other atoms in Gi , which are collected, after normalising 
by Wi, in a random probability measure G'. 

For each i = 1,... ,N and k = 1,..., K *, let ni k be the number of DNA segments in sequence i assigned 
to population k. In the Chinese restaurant franchise representation of the HDP, we introduce a set of 
discrete auxiliary variables niik, taking value 0 if n t k = 0, and values in the range {1,... ,nik} when rijj. > 
0. Define nofc = Then the conditional distributions of the random probability measures given 

[riik, mik)o<i<N,i<k<K* is described by the following [56], 

(qoi, ■ ■ • ,qoK* ,w 0 )\(nik,mik) ~ Dirichlet (n 0 i,..., n 0K *, a 0 ) (2-10) 

(qn, • ■ •, qi.K*, Wi)\{n ik , m ik ), (goi, • • •, do k -), w 0 ~ Dirichlet (aq 0 i + n t i,..., aq 0K > + n 0K *, aw 0 ) 

G' 0 \{nik,mik) ~ DP(ao,if) (2.11) 

G'i\(n ik , mik), G' 0 ~ DP(a, G' 0 ) 

where the masses form a hierarchy of finite-dimensional Dirichlet distributions while the random probability 
measures are independent of the masses and form a hierarchy of DPs as in the prior. 
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A final point of consideration relates to the fact that the random probability measures G' 0 , (G') have 
infinitely many atoms, so not all can be simulated explicitly with finite computational resources. We address 
this using a retrospective slice sampling technique to truncate the random probability measures while retain¬ 
ing exactness [58, 37, 20]. For each individual i, we introduce an auxiliary slice variable G,;, with conditional 
distribution given the other variables in the model: 

9T n = ( 2 - 12 ) 

Ci\(nik,m ik ),Go, (G*) ~ Uniform[0, g™ 111 ]. (2.13) 

The slice variables are sampled just before the latent state sequences (whose updates are described in the 
next subsection). Further, conditioned on the slice variables, populations whose mass fall below Ci will have 
zero probability to be selected when the latent state sequence for individual i is updated. As a consequence, 
only the (finitely many) atoms with mass above the minimum threshold min^ C, need be simulated. This can 
be achieved by simulating G' 0 and (G') using the hierarchical stick-breaking representation (2.4), (2.5) until 
the left-over mass falls below the threshold. 

2-4-2. Updates to latent state sequences 

We will use a forward-filtering backward-sampling algorithm to resample the latent state sequences one at 
a time. Conditioned on the slice variable Ci, only populations with q, k > Ci will have positive probability 
of being selected, and so the forward-backward algorithm can be computationally tractable. However, as 
the slice variable depends on all latent state variables, conditioning on the slice variable introduces complex 
dependencies among the latent state variables which precludes an exact and efficient forward filtering al¬ 
gorithm. We propose instead to ignore the dependencies caused by the slice variable, and use the resulting 
efficient forward-backward algorithm as a Metropolis-Hastings proposal. 

Suppose that there are Ki populations with proportions above the slice threshold Ci- For simplicity of ex¬ 
position, we will reindex the populations such that their indices are simply {1,..., iF, }. The forward-backward 
algorithm efficiently samples from the following proposal distribution which ignores the slice threshold Cp. 

Q((z u , °C P((^,S ! ;;)f = i,G 1 )P((a; ! : i )f = l|( 2 ipS lZ )f =1 |G i ) (2.14) 

where the population indicators range only over 1 ,,Ki- 

The forward filtering phase first computes the following probabilities using dynamic programming: 

Ml l k = P(xii,.. ■, Xu , sn = b,zu = k\(qik,dk)kh) (2.15) 

M% =¥(x n ,...,x i i,z i i = k\(q ik , 0 k ) k ^i) 

M l ! = P(ajji)... ,Xit\(q ik , 9 k ) k ^) 

with b £ {0,1} and k £ {1,..., Ki}. The dynamic programme starts at l = 1: 

A7 k = 6 k \xi±qik 

and proceeds with l = 2, ..., L: 

Mf k = 0 klXil e~ rdl ~' M*- 1 M a k = M» k + M{ l k 

Ki 

= 9 klXil (1 - e"^- 1 )q ik M il - 1 M il 

fc =i 

Recall that 0 k i Xil is the probability that locus l in population k assume the observed value Xu. The backward 
phase samples from the proposal distribution, starting at l = L: 

Q{z lL = k ) cx Mf 
Q(siL = b\z iL = k ) oc M l bk 
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and iterates backwards, for l = L — 1 ,..., 1 : 


Q(zu = k\su + i = 1, zu +1 = k') a 1 (k = k') 

Q(zu = k\su +1 = 0, zu + 1 = k') oc M d k 
Q{su = b\zu = k) a M kk . 

where l(-) denotes the indicator function and sn = 0 by construction. In this way we obtain a new sample 
for the collection of su and zu. Finally, the Metropolis-Hastings acceptance probability is a simple expression 
which accounts for the effect of conditioning on G,y 


min 



min-prop 


(2.16) 


where g™ ln - cur and g™ 111- ? 10 ? indicate the minimum population proportions (2.12) under the current and 
proposed states respectively. 

The forward-backward algorithm has a computational scaling of O(LKi), linear in both the length of 
the sequence and the number of potential populations, and is the most computationally expensive part of 
the MCMC algorithm. It must be noted that, since the ( Gi ) are conditionally independent given Go, the 
algorithm can be easily parallelised so as to exploit modern parallel computation technology. 

2.5. Extensions 

The model can be straightforwardly extended to diploid or polyploid data, by assuming that, for each 
individual i, the z-i along each of individual i’s chromosomes form independent Markov chains satisfying (2.2). 
Other extensions of the Bayesian nonparametric admixture model can be introduced to allow correlated 
allele frequencies. For instance, following the approaches of Pritchard et al. (2000) [43] and Falush et al. 
(2003) [13], it is straightforward to introduce a Bayesian nonparametric admixture model with correlated 
allele frequencies. Specifically, we can assume that allele frequencies in one population provide information 
about the allele frequencies in another population, i.e. frequencies in the different populations are likely 
to be similar (probably due to migration or shared ancestry). This can be achieved by specifying a more 
complex prior structure on 0 k i., for example employing the correlated allele frequencies model of Falush et 
al. (2003) [13], which assumes that allele frequencies at locus l in different populations are deviations from 
allele frequencies in a hypothetical ancestral population. 

At the moment, we use only genetic data to infer admixture parameters. Often it can be useful to include 
in the model extra information such as physical characteristics (e.g. ethnicity) of sampled individuals or geo¬ 
graphic sampling locations [23] . These new sources of information would modify the clustering structure and 
would allow the proportion of individuals assigned to a particular cluster to depend on the new information. 
This would require a specification of a spatiality dependent model on the weights of the random measures 
in the HDP. 

From a Bayesian parametric perspective, we could also employ other priors such as the Pitman-Yor process 
[40] and the hierarchical Pitman-Yor process [57]. The Pitman-Yor process is a two-parameter generalization 
of the DP, for which a stick-breaking construction and a Chinese restaurant representation still hold. Under 
certain assumptions, it can be shown that in the Pitman-Yor process the number of clusters grows much 
faster than for a standard DP and that the cluster sizes decay according to a power law. This property makes 
the Pitman-Yor process a more suitable choice in many applications. The implementation of this more flexible 
prior would require more expensive computations due to the larger number of extant populations possible. 

3. Simulation studies 

In order to assess the performance of the model in recovering the number of ancestral populations, we 
perform simulations based on three demographic scenarios. Each scenario consists of 200 haploid sequences. 



We consider 60 bi-allelic genetic markers on a lOOKbp segment. Coalescent simulations were performed using 
the software ms [24]. Mutation and recombination rates were set to 2 x 10~ 8 and 10~ 8 per base pair per 
generation respectively. The focus is to identify the populations of origin and to infer demographic history. 
In particular, the main goal is inference on K. The three simulated scenarios we consider are: (i) sample 
from a single random mating population; (ii) admixture model with two parental populations and admixture 
proportion of 0.6 and 0.4; (iii) admixture model with three parental populations and admixture proportion 
0.5, 0.4 and 0.1. Figure 5 shows the posterior distribution of K (after burn-in) under the three scenarios. 
HDPStructure successfully recovers the number of ancestral populations. 

In the simulations we have set the parameters for the Gamma priors on a and a o as follows: a ~ 
Gamma(l,l) and a 0 ~ Gamma(5,1). Although inference can be sensitive to the choice of the prior on 
the number of populations, i.e. to the prior specification on a and ao, we note that as the number of 
sequences and/or markers increases the model tends to generate spurious clusters, i.e. clusters with very few 
individuals in them. This is in line with recent results on the clustering properties of the Dirichlet Process 
[31]. Nevertheless, the number of clusters that covers the majority of the data, i.e. 95-99%, is quite robust 
to prior specifications (sensitivity analysis results not shown). In general, the biological interpretation of I\ 
is difficult. This is in agreement with the suggestion of Pritchard et al. (2010) [44]: 

We may not be able to know the TRUE value of K , but we should aim for the smallest value of K that captures the 

major structure in the data. 

We compare our results with the software STRUCTURE [43, 13, 23], which is arguably the most widely 
used software in applications to infer population structure. This software implements the parametric version 
of our model, with fixed K (http://pritchardlab.stanford.edu/structure.html). We have run the 
Linkage model described in Falush et al. (2003) [13], using default settings and assuming independent allele 
frequencies between markers. To estimate K , Pritchard et al. (2000) [43] suggested using the value of K 
which maximises the estimated model log-likelihood, logPr(Data | K ). This latter quantity is estimated by 
the MCMC run using an approximation based on the harmonic mean estimator of the Bayesian deviance. 
For each scenario, we have run STRUCTURE for each value of K, K = 1,..., 8. Figure shows the estimated 
logPr(Data | K ) for the three simulated examples. The value K = 1 seems to maximises the model log- 
likelihood under all scenarios. In general, the authors of the software warn against possible drawback of using 
this criterion and to interpret the results with caution and give suggestions for improvement. 


4. Experiments 

We demonstrate our model on a dataset of 372 Colombians recently genotyped on the Illumina Human610- 
QuadvUB SNP array as part of a genome-wide association study [51]. Latin American samples are uniquely 
advantageous for this purpose [50] because of their well-documented history of extensive mixing between 
Native Americans and people arriving from Europe and Africa. This continental admixture, which has 
occurred for the past 500 years (or about 20-25 generations), gives rise to haplotype blocks which are about 
the right length for such analysis. Ancient admixture produces very short haplotype fragments which are 
hard to assign ancestry with certainty, while very recent admixture allows only large haplotype blocks and 
there is not sufficient variation in ancestry for individuals. 

The Native American population arose as a branch of the East Asian populations who were separated 
over 15,000 years ago and consequently isolation and genetic drift shaped their genetic landscape. This 
caused many SNPs to drift even more than their East Asian counterparts, eventually becoming fixed at 
the alternative allele. The Ectodysplasin-A receptor (EDAR) gene, located on chromosome 2, is a common 
example, in particular SNP rs3827760 [30], whose ancestral A allele is 100% prevalent in European and 
African populations, but the alternative G allele is seen at 94% frequency in Han Chinese and 100% in Native 
Americans. The SNP, a missense mutation, has been observed to have a range of functional effects in humans 
and replicated in other mammals such as mice, including the characteristic straight hair shape in East Asians 
[18, 54] and dental morphology [27, 36]. Our dataset does not contain rs3827760, but neighbouring SNPs in 
LD to rs3827760 are included in the chip panel. This shows a strength of our model does as we manage to 
capture the ancestries even in absence of SNP rs3827760, the well-known causal and ancestry-informative 
SNP, by making good use of LD information. Figure 5 shows the LD plot for the EDAR region in the 
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Correlation 

European anc. 

Amerindian anc. 

African anc. 

Cluster 1 (Europe) 

0.90 

-0.67 

-0.36 

Cluster 2 (America) 

-0.75 

0.92 

-0.20 

Cluster 3 (Africa) 

-0.36 

-0.19 

0.76 

Cluster 4 (New) 

0.04 

-0.24 

0.28 


Table 1 


Correlation between ancestry proportions and cluster occurrence proportion from the Bayesian nonparametric model. 


Colombian samples. Overall, EDAR signalling acts during prenatal development to specify the location, size 
and shape of ectodermal appendages, such as hair follicles, teeth and glands [30]. Therefore, we considered 
EDAR to be an interesting candidate for admixture analysis as it carries information regarding ancestry due 
to its variation across ancestry as well as its range of functional effects, which means it may be showing some 
signal of selection. 

Genotype information on 372 individuals for 16 SNPs in the EDAR region was available from our Illumina 
chip data. Genotypes were phased for conversion to haplotype format by Shapelt2 [10]. Data from a total of 
828 individuals sampled in putative parental populations were used as reference ancestral groups. These were 
selected from HAPMAP, the CEPH-HDGP cell panel [28] and from published Native American data [48] as 
follows: 169 Africans (from 5 populations from Sub-Saharan West Africa), 299 Europeans (from 7 West and 
South European populations) and 360 Native Americans (from 47 populations from Mxico Southwards). 

We ran the MCMC sampler for 50,000 iterations. We collected samples after a burn-in of 20,0000 iterations 
and thinned every 30 iterations. We specify the following prior distributions for the precision parameters in 
the HDP: ao ~ Gamma(l, 1), a ~ Gamma(10,20). We centre the prior for the mean parameters of the Beta 
base measure of the HDP around the overall observed allele frequencies, with c = 0.01. The prior for logr is 
a Uniform on the interval [—500, 5]. 

The posterior analysis shows evidence of four major ancestral populations in the set of 744 Colombian 
haplotypes (see Figure 5). We use the MCMC output to estimate the cluster assignment, i.e. population 
allocation, to each of the 4 major ancestral populations for each haplotype sequence and each marker. In 
Figure 5, we summarise the MCMC output by reporting the clustering that minimizes the posterior expec¬ 
tation of Binder’s loss as described by Fritsch and Ickstadt (2009) [17], who also discuss possible alternatives 
such as Maximum a posteriori clustering. The four major clusters have admixture coefficients, i.e. relative 
proportion of occurrences of each of the clusters, 51.8%, 32.1%, 11.4% and 4.7% respectively. As we have 
used a reference panel, we are able to identify in the first cluster, in terms of cardinality, European-origin 
haplotypes in the sampled Colombians. The second and third clusters correspond to Amerindian and African 
respectively. This is also confirmed by looking at the “most frequent” haplotype in each cluster. Figure 5 
shows the raw data assigned to each of the four major ancestral populations. We verify our findings in 
two ways. Firstly, we calculate genetic ancestry proportions using reference genotypes as reference ancestral 
groups. EDAR-specific ancestry proportions for each of the 372 Colombian samples were estimated using 
Admixture software [2], which provides a faster implementation of the same model in STRUCTURE. We 
correlated these ancestry proportions to our cluster occurrence proportion (see Table 4). The correlation 
values are very high and support our assignment of ancestry category to the first three clusters. The average 
European, Amerindian and African ancestry across Colombian samples are 53.6%, 30.8% and 16.6% respec¬ 
tively, which is also very close to our cluster proportions. However, Admixture is a supervised approach and 
so it cannot give us further details about the fourth, rarer cluster. To explore it further, and also for another 
line of verification, we calculate genetic principal components, in which SNP genotypes for each person is 
recoded into 0/1/2 by an additive count of the minor allele on two chromosomes, and this SNP genotype 
count matrix is converted to principal components (PC) via the usual method. As European-1-Amerindian 
continental genetic mixing is the primary source for admixture in our data, the first PC axis reflects this, 
being positively correlated with European samples and negatively with American samples. The second PC 
captures the other continental component in our samples, namely the African samples. More specifically, 
PCI captures the European-Amerindian axis of variation, and PC2 captures the African-European axis. In 
Table 4 we show correlations of PCs with supervised ancestry values. As further PCs are orthogonal to these, 
they do not show high correlation with any ancestry component. Consequently, the first PC shows high cor¬ 
relations with the first two clusters, and PC2 with the third cluster. As shown in Table 4 the third PC is 
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European anc. 

Amerindian anc. 

African anc. 

PCI 

0.81 

- 0.95 

0.15 

PC2 

-0.60 

-0.04 

0.90 

PC3 

-0.09 

0.00 

0.13 

PC4 

0.15 

0.03 

-0.25 


Table 2 

Correlations between principal components and supervised ancestry values. 


Correlation 

PCI 

PC2 

PC3 

PC4 

Cluster 1 (Europe) 

0.73 

-0.58 

-0.26 

0.24 

Cluster 2 (America) 

-0.90 

0.04 

0.02 

-0.01 

Cluster 3 (Africa) 

0.06 

0.72 

-0.07 

-0.12 

Cluster 4 (New) 

0.24 

0.31 

0.71 

-0.40 


Table 3 

Correlations between principal components and cluster assignment. 


highly correlated with the new cluster, which validates the signal we capture as genuine genetic component 
and not a statistical artefact of our method. To investigate the genetic source of the new cluster, we look at 
the average allele frequency for each SNP in all the clusters, and then take the difference for the new cluster 
vs. all the others. Figure 5, top panel, shows the absolute differences: we see clearly that only SNPs 11 and 
13 primarily contribute to this cluster. The same is seen when we plot the weights given by PC3 onto each 
SNP (Figure 5 bottom panel). These two SNPs - rs260693 and rs260696 - are rare SNPs, i.e. their minor 
allele which is recognized in cluster 4 and PC3 is rare. For example, rs260693 has a global MAF of 3.8%, with 
the minor allele only primarily seen in Europe (9%) while nearly being absent in Africans (1.8%) and East 
Asians (0%). Their two minor alleles are highly in LD, with a D’ of 1 in European populations. This shows 
that the haplotype that contains the two minor alleles for these two SNPs is also rare, and is being identified 
as the fourth separate cluster by our model. Table 4 reports the posterior mean of the Oki of the 16 SNPs in 
each of the four major populations while Figure 5 shows the posterior mean of admixture proportion itik for 
few randomly selected individuals. 

5. Discussion and concluding remarks 

We have presented the Bayesian nonparametric counterpart of the Linkage model of Falush et al. (2003) [13] 
to infer genetic admixtures. The model allows for both the number of ancestral population and the assignment 
vector to be random, avoiding the use of model selection criteria. The model can be applied to commonly 
used genetic markers and does not rely on specific assumption on the mutation model. We incorporate 
dependence between markers due to correlation of ancestry by specifying an inhomogeneous Poisson process 
on the DNA sequence. Each population is modelled using a simple and independent allele-frequency profile. 
We have developed an MCMC algorithm which allows us to perform posterior inference on the number 
of ancestral populations, the population of origin of chromosomal regions, the proportion of an individual’s 
genome coming from each population, the admixture proportions in the population and the allele frequencies 
in ancestral populations. We have demonstrated the model in simulations and on real data from the EDAR 
gene. The model has been able to highlight the existence of a rare European haplotype. We have highlighted 
possible extensions to our method. An interesting direction for future development is to relax the assumption 
of independent allele-frequency profile in each population by incorporating ideas from Sohn et al. (2012) [53] 
and model each population as a hidden Markov model over a set of founder haplotypes. 

In this article we have devoted considerable attention to inferring K and shown how Bayesian nonparamer- 
tic methods automatically provide posterior inference on the number of ancestral populations. Nevertheless, 
we must be careful when interpreting K. The nonparametric setup will generally yield sensible clustering 
but clusters will not necessarily correspond to “real” populations. This problem is in common with most of 
the model-based structure algorithms [43]. 
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Europe 

Africa 

America 

Rare 

rs6542783 

0.963 

0.998 

0.580 

1.000 

rs2378217 

0.900 

0.982 

0.299 

0.000 

rs940928 

0.907 

0.981 

0.452 

0.000 

rs260633 

0.835 

0.973 

0.289 

0.000 

rs260642 

0.877 

0.968 

0.539 

0.001 

rs6542787 

0.129 

0.956 

0.495 

1.000 

rs260714 

0.073 

0.968 

0.822 

0.000 

rs260698 

0.063 

0.970 

0.624 

0.000 

rs260690 

0.971 

0.036 

0.388 

1.000 

rsl3427084 

0.949 

1.000 

0.798 

1.000 

rs260693 

0.943 

0.996 

0.985 

0.000 

rs11691107 

0.973 

1.000 

0.638 

1.000 

rs260696 

0.808 

0.990 

0.916 

0.000 

rsl2992554 

0.348 

0.816 

0.865 

1.000 

rsl7269487 

0.369 

0.823 

0.872 

1.000 

rs260681 

1.000 

1.000 

0.738 

1.000 


Table 4 

Posterior mean of O^i, l = 1,.... 16 in each of the four major populations. The colour gradient in each cell is proportional to 

the numerical value. 


Appendix 


We develop a Gibbs sampler algorithm, in which each variable is updated given the remaining variables fixed 
at their current value. 


Update of Gq. See section 2.4 of main manuscript. 

Update of Gi. See section 2.4 of main manuscript. 

Update of and irtik ■ Assume that at iteration w there are K populations in the model. We need to 
resample the seating arrangement of the Chinese restaurant of Gi. Updating is straightforward as 
the new sample is simply the number of linked segments with zu = k. That is, 

L 

n ik = ^2 1 (su = 0)1 (zu = k). 
i=i 

Then, niik can be sampled from the distribution of the number of tables in Chinese Restaurant Process 
with customers and mass parameter ag 0 fc- That is, 


‘TPik 


bj 


nik 




Bernoulli ( -—-) 

\aqik + j ~ 1/ 


where bj = 1 if customer j joins a new table. 

Update of 9u- Assume that at iteration w there K populations in the model. The posterior p(9ki) | rest) 
is the same as in a simple parametric Bayesian model using as observations all the markers for which 
Zu = k. In the case of SNP data the conditional posterior of 9ki is a Beta distribution. 

Update of a. The concentration parameter ao governs the distribution of the number of of 9pi s in each 
mixture. We follow Teh et al. (2006) [56]. Assume that at iteration w there K populations in the 
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model. Let to.. = k rriik and n*. = Y2k nik - We introduce latent variable Wi £ [0,1] and ti £ {0,1}, 
i = 1,..., N, with 


Wi | a 0 


ti | Otfl 


oc 


Beta(ag + U ?r) 



If a is given a T(a,b) hyperprior, then given Wi and ti, a has a Gamma distribution with parameters 
a + to.. - J2i =l ti and b - X)i=i log Wi- 

Update of ao- Given the total number to.. = Y2i k °f the 9ki s, the concentration parameter ag governs 
the distribution of the number of population K. We use the auxiliary method of Escobar and West [11]. 
If cto is given a r(o, b) hyperprior, it can be resampled by introducing a latent variable 7 ~ Beta(ao + 
1 ,to..) and 

p(a 0 | K, 7 ) = 7 iT(a + K, b - log( 7 )) + (1 - 7 r)r(a + K - 1, b - log( 7 )) 
where 7 r/(l — n) = (a + K — 1 )/m..(b — log( 7 )). 

Update of r. The rate r of the Poisson process is given a Uniform prior on some interval [r^, rjj\- We use a 
random walk Metropolis step to update r with proposal distribution centred around the current value. 

Update of hyperparameters in the base measure H . The proportion of the model involving the hy¬ 
perparameters in H is a convential parametric model. Hence, conditioning on all the other variables 
usually leaves us with a standard Bayesian model, often in conjugate form. In the case of SNP data, 
we have taken 9k ~ H = TIcLi Beta(c/i/, c(l — pi)). We specify independent Beta(cp, 6 i) for each pi, 
l = 1,... ,L, and update pi using a random walk Metropolis step with proposal distribution centred 
around the current value. 
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Fig 1. Pr(iF | Data ) under the three simulated scenarios. 
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Fig 3. LD plot for the EDAR region in the Colombian samples. 
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Fig 4. EDAR data: posterior distribution of the number of clusters K. 
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Fig 5. The top panel shows the genetic data, with black representing the 0 allele for each SNP. The bottom panel presents 
summary of the posterior population assignment obtained by minimising the Binder loss function. 
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CLUSTER 2 




Fig 6 . Each panel shows the data distribution for each of the four major populations: white indicates markers on each sequences 
not assigned to the specific population, grey denotes the 1 allele and black denotes the 0 allele. 
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Fig 7. Top panel: absolute value of the difference between the average allele frequency for each SNP in all the clusters and the 
frequency in the rare cluster. Bottom panel: loadings for each SNP of the third PC. 
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Fig 8. Posterior admixture proportions for randomly selected haplotypes. 
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